arXiv: 1506.04443vl [q-bio.QM] 14Jun2015 


PROBABILISTIC APPROACH FOR EVALUATING METABOLITE SAMPLE 

INTEGRITY 

BARRY M. SLAFF, SHANE T. JENSEN, AND AALIM M. WELJIE 


Abstract. The success of metabolomics studies depends upon the “fitness” of each biological 
sample used for analysis: it is critical that metabolite levels reported for a biological sample repre¬ 
sent an accurate snapshot of the studied organism’s metabolite profile at time of sample collection. 
Numerous factors may compromise metabolite sample fitness, including chemical and biological 
factors which intervene during sample collection, handling, storage, and preparation for analysis. 
We propose a probabilistic model for the quantitative assessment of metabolite sample fitness. 
Collection and processing of nuclear magnetic resonance (NMR) and ultra-performance liquid 
chromatography (UPLC-MS) metabolomics data is discussed. Feature selection methods utilized 
for multivariate data analysis are briefly reviewed, including feature clustering and computation 
of latent vectors using spectral methods. We propose that the time-course of metabolite changes 
in samples stored at different temperatures may be utilized to identify changing-metabolite-to- 
stable-metabolite ratios as markers of sample fitness. Tolerance intervals may be computed to 
characterize these ratios among fresh samples. In order to discover additional structure in the 
data relevant to sample fitness, we propose using data labeled according to these ratios to train a 
Dirichlet process mixture model (DPMM) for assessing sample fitness. DPMMs are highly intu¬ 
itive since they model the metabolite levels in a sample as arising from a combination of processes 
including, e.g., normal biological processes and degradation- or contamination-inducing processes. 
The outputs of a DPMM are probabilities that a sample is associated with a given process, and 
these probabilities may be incorporated into a final classifier for sample fitness. 


1. Introduction 

Quantitative analysis of metabolite levels in biofluids and tissues has become a fruitful approach 
in medical and translational research. The success of such studies depends upon the “fitness” of 
each biological sample used for analysis. Specifically, it is critical that metabolite levels reported 
for a biological sample represent an accurate snapshot of the studied organism’s metabolite profile 
at time of sample collection. Between sample collection and final analysis, numerous factors may 
compromise metabolite sample fitness, including chemical (e.g. thermodynamic) and biological (e.g. 
bacterial) factors which intervene during sample collection, handling, storage, and preparation for 
analysis. We propose a probabilistic model for the quantitative assessment of metabolite sample 
fitness. The proposed model may be implemented as a computational tool which uses the mea¬ 
sured metabolite profile from a biological sample to estimate the sample’s fitness for inclusion in 
metabolomics analyses. 

In the present work, we present an approach for developing a probabilistic model of metabolite sam¬ 
ple fitness. In Section[2]we discuss the proposed analytical methods for collecting metabolite sample 
data. The proposed analytical platforms are nuclear magnetic resonance (NMR) spectroscopy with a 
targeted profiling approach for quantitation together with ultra-performance liquid chromatography 
coupled to mass spectrometry (UPLC-MS). These are established methods for the acquisition of 
quantitative metabolite data from biofluid and tissue samples. Additionally, we discuss the critical 
data processing steps of normalization, centering, and scaling. Finally, we discuss approaches to 
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feature selection for the final model-building process. 

In Section IBTTl we discuss the principle of metabolite sample fitness. Critical to the modeling process 
is a quantitative definition of sample fitness: how should fitness be assessed? Existing studies sug¬ 
gest the possibility of identifying metabolites that change significantly in concentration (“changing 
metabolites”) and metabolites which remain relatively stable (“stable metabolites”) during sample 
storage at different temperatures. We propose that at least one changing-metabolite-to-stable- 
metabolite ratio might be identified for each biological matrix, and tolerance intervals will be com¬ 
puted to characterize the range of ratios among fresh samples. Therefore, a sample can be deemed 
fit or unfit based on whether its changing-metabolite-to-stable-metabolite ratios fall within the com¬ 
puted tolerance intervals for fresh samples. While this method offers a straightforward computation, 
it does not capture all the structure available in the data for differentiating fresh and degraded sam¬ 
ples. Therefore, we propose that data classified using tolerance intervals may be used to train a 
probabilistic model capable of capturing additional structure in the data. 

In Section [3. 21 we propose approaches for modeling metabolite sample fitness based on the principle 
discussed in Section 13.11 Key to our approach is development of a model which avoids incorrect 
parametric assumptions. Our main modeling approach is development of a Dirichlet process mixture 
model (DPMM) for each biological matrix for which we wish to assess sample fitness. DPMMs are 
highly intuitive since they model the metabolite levels in a sample as arising from a combination 
of processes including, e.g., normal biological processes and degradation- or contamination-inducing 
processes. The outputs of a DPMM are probabilities that a sample is associated with a given 
process, and these probabilities will be incorporated into a final classifier for sample fitness. We also 
consider the use of conceptually simple non-parametric methods such as k-Nearest-Neighbor and 
kernel regression. 


2. Model Inputs: Data Acquisition and Processing 

Nuclear Magnetic Resonance (NMR) Spectroscopy and Ultra-Performance Liquid Chromatogra¬ 
phy - Mass Spectrometry (UPLC-MS) are state-of-the-art analytical platforms of the acquisition of 
quantitative metabolite data from urine, plasma, serum, tissue, and other biological matrices nHa- 
Quantitative data obtained using NMR and UPLC-MS will be utilized to model and assess metabo¬ 
lite sample integrity. 


2.1. Data Acquisition. Experimental design and data acquisition in metabolomics must avoid 
contamination of the data with systemic errors and variances that can compromise analyses [B]. Tar¬ 
geted profiling m with NMR spectroscopy produces quantitative metabolite concentrations which 
are reproducible within Elio] and between mm labs, with more variation when NMR probes 
and experimental parameters are not consistent (13] • A Design of Experiments (DoE) approach 
together with UPLC-MS yields reproducible metabolite data in quantitative and non-quantitative 
approaches [IMS- Appropriate measures will be taken in the experimental design to avoid biases 
arising from test subject selection and temporal factors (e.g. time of day of sample collection). 
Sample preparation and storage procedures will be tightly controlled apart from those varied delib¬ 
erately as part of the experimental procedure for inducing sample degradation. Our investigation of 
metabolite sample integrity will inform existing domain knowledge regarding proper sample prepa¬ 
ration and storage for metabolomics studies [IMS and biobanks [2411261 . 
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2.2. Data Processing. Normalization, centering, and scaling are essential steps for metabolomics 
data processing. Normalization is a critical step for purposes of comparing spectra acquired from 
samples with different levels of dilution [27H23] . This concern is particularly critical in the case of 
urine samples US ED!.. The simplest normalization method is total integral normalization, which 
assumes that all spectral peaks scale with sample dilution. We also consider probability quotient 
normalization 31] (PQN), also called median-fold change normalization [32] (MFC), which scales 
the peaks in each spectrum by one factor per spectrum so that the median fold-change between the 
peaks in each spectrum and corresponding peaks in a reference spectrum is 1. In contrast to total 
spectral normalization, PQN/MFC assumes that most rather than all spectral peaks scale with sam¬ 
ple dilution. We also consider an additional normalization step which would minimize the distance 
from the sample vector to a modeled probability distribution. The metric used to evaluate nearness 
might be Euclidean distance as a default, or for example Mahalanobis distance [33] if the clusters 
are modeled as multivariate Gaussians. In the case of Gaussian clusters with diagonal covariance, 
this computation is analytically simple m and it is analytically or numerically computable in other 
cases. Total integral normalization (TIN) and PQN/MFC are widely used and have been studied 
comparatively in the context of both NMR m and LC-MS [28 metabolomics data. It has been 
shown that use of TIN or PQN/MFC improves the results of comparing spectra relative to no nor¬ 
malization, while the optimal method varies between contexts [27112111 . 

Prior to model-building with training data and classification with new data, the data may be mean- 
centered and scaled feature-by-feature. Scaling assumes that the features with the most variance 
are not necessarily the features with the most predictive value, since relatively abundant features 
tend to have greater variance. Several scaling methods widely used in metabolomics studies [3 3 [[36] 
include auto-scaling (each feature in the training data is scaled to have variance 1), Pareto scaling 
(each feature scaled so that its variance is the square-root of its initial variance), Variable stability 
(VAST) scaling [37] (features with smaller coefficient of variation are given more weight), and range 
scaling (each feature is scaled by its full range). The optimal scaling approach has been found to be 
highly context-dependent, and the results of modeling depend significantly upon the scaling method 
utilized j3Sll36l . 

2.3. Feature Selection. It is often advantageous in data analysis contexts to utilize a subset of 
the acquired features or generate new features for the final modeling task. For example, eliminating 
irrelevant or noisy features can improve the predictive performance of any final model. Additionally, 
generating new features which are highly relevant to the prediction can improve model performance. 

With respect to choosing subsets of acquired features for modeling, we consider the following: 

(1) Since the consistently-detectable metabolites are known to differ across matrices for NMR [T 
and LC-MS [3], features should be selected on a per-matrix basis, i.e. one feature set for 
human urine, one feature set for human serum, etc. The full panel of reliably-detectable 
metabolites will be profiled initially. 

(2) A Design-of-Experiment approach P2][33}[39] (DoE) may be used to experimentally narrow 
the matrix-specific feature lists. For example, metabolites common to the same molecular 
pathways may exhibit high co-linearity, which would be detectable with a DoE approach. 

(3) In the case of a parametrized clustering approach, it may be useful to include only features 
which satisfy certain parametric constraints. For example, in a model which constructs 
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multivariate Gaussian clusters, it may be interesting the maximize the Gaussian character 
of the joint distributions through feature selection. Multivariate Gaussian character can be 
assessed using an R package such as MVN m • This may be achieved by selecting only 
features (possibly after transformation) with univariate Gaussian character, which can be 
assessed with a univariate test for normality such as Lilliefors [6, 411142] with an R package 
such as nortest 33] , 

It may be desirable to combine the initial metabolite features into new features for purposes of mod¬ 
eling metabolite sample fitness. Numerous feature selection methods have been utilized in ornics 
studies, in particular in domains for which studies typically include many more features than sam¬ 
ples |44|451 - Possible feature-combination approaches for modeling metabolite sample fitness include: 

(1) Components from unsupervised spectral methods: utilizing principal component analysis 
(PCA) vectors (equivalently, singular vectors and values) or independent component analy¬ 
sis (ICA) vectors as features [451 . 

(2) Latent vectors: utilizing latent vectors from partial least squares regression [47] (PLS), 
canonical correlation analysis [45] (CCA), or related spectral methods such as orthogonal 
PLS [49] (O-PLS) and O-PLS with discriminant analysis [50] for classification (O-PLS-DA). 
The orthogonalized methods remove systemic variation in the data unrelated to the response 
variables to improve interpretability. 0-2-PLS [51] additionally yields two-way information 
about covariation and predictiveness between the observed and response variables. For these 
methods, the response variables could be, for example indicators of sample degradation such 
as time of sample exposure to non-freezing storage temperature. These methods have been 
used widely for metabolomics data analysis ;211!52lf53] . 

(3) Feature clustering: The method of shrunken centroids ,54], which originated as a feature 
selection method in genomics, has been applied in a metabolomics context [551155] for choos¬ 
ing a subset of highly representative features. Other clustering approaches involve using 
mutual information [57 or using a graph-theoretic approach j58_ to identify clusters and 
choose representative metabolite features. 

(4) Multiple-testing framework for discovering significant features: kernelized support vector 
machines m, k-nearest-neighbor m, and classificaion trees m have been used together in 
a multiple testing framework to identify individual metabolite features [52) . This approach 
is not widely used in the metabolomics literature but is attractive particularly since the false 
discovery rate can be controlled via the multiple testing framework. 

3. Modeling Sample Fitness 

We wish to distingish fit samples from unfit samples. In principle, a sample is fit for analysis if 
its measurable metabolite levels at time of analysis are very similar to its measurable metabolite 
levels at time of collection. According to this principle, a sample is fit if at time of measurement, 
it accurately captures a collection-time snapshot of the studied organism’s metabolite profile. The 
original metabolite levels change over time due to intervention from chemical and biological factors 
during sample handling, storage, and preparation for analysis. Therefore, our central problems are 
the following: 
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(1) Quantify the degree of change in metabolite levels after collection for which the sample 
should no longer be considered “fit” for analysis. We follow the recommendation of Fraser 
et al [13], now widely adopted for quality control reporting in clinical medicine [ 6 T, and 
define three levels of sample fitness: optimal, desirable, and minimal. 

(2) Construct a probabilistic model of metabolite sample fitness so that a sample can be accu¬ 
rately categorized as fit (optimal, desirable, or minimal) or unfit based upon its reported 
metabolite levels. The model will be trained on data for which we have followed the time- 
course of metabolite level changes from the absolutely fresh state to various states of degra¬ 
dation. The result of (1) will be used to label each training data point as fit (optimal, 
desirable, or minimal) or unfit. The trained probabilistic model will be used to predict the 
fitness or non-fitness of samples for which we have only one measurement of metabolite levels. 

Problem (1) is the subject of Section HTTI and (2) is the subject of Section HOI 

3.1. Principle of Metabolite Sample Fitness. Recent studies identify urine, blood, and plasma 
metabolites that change concentration significantly over hours and days in response to storage at 
above-freezing temperatures mmm- This degradation process transforms a sample from a state 
of fitness to unfitness. We hypothesize that ratios between changing metabolites and stable metabo¬ 
lites during storage can identify fresh vs. degraded samples. For this purpose we propose the use 
of tolerance intervals [6511661 for characterizing changing-to-stable metabolite ratios in optimally-fit, 
desirably-fit, and minimally-fit samples. For each biological matrix, at least one changing/stable 
metabolite pair should be identified from the available data. We propose the following taxonomy: 


(1) An “optimally” fit sample is one which falls inside a tolerance interval containing 80% of 
the fresh-sample ratios with 95% confidence (i.e., .80-content, .95-coverage TI). 

(2) A “desirably” fit sample is one which falls inside a tolerance interval containing 95% of the 
fresh-sample ratios with 95% confidence (i.e., .95-content, .95-coverage TI) and which is not 
“optimally” fit. 

(3) A “minimally” fit sample is one which falls inside a tolerance interval containing 99% of the 
fresh-sample ratios with 95% confidence (i.e., .99-content, .95-coverage TI) and which is not 
“desirably” fit. 


Tolerance intervals may be computed for data arising from approximately normal [67] or non-normal 
distributions ESI ■ In the case of approximately normal distributions, the two-sided p-content, 7 - 
coverage tolerance interval is 

[Y-s-k,Y + s-k], 

where Y is the sample mean, s is the sample standard deviation, and k is computed as: 


( 1 ) 


k = 


\ 


v l 1 + -k) 4- 


(l~P)/2 




where Xi--y v is the critical value of the chi-square distribution with degrees of freedom v (usually 
u = N-l) that is exceeded with probability 7 , and Z( i - P )/2 critical value of the normal distribution 
associated with cummulative probability (l-p)/ 2 . 


Changing-to-stable metabolite ratios offer a straightforward method for identifying fit and unfit 
samples. However, this method is not comprehensive: it may not identify all structure present 
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in the available data for differentiating fresh and degraded samples. We therefore propose that 
changing-to-stable metabolite ratios may be used to train a probabilistic model which is capable of 
capturing additional structure in the data. We propose to utilize a Dirichlet process mixture model 
for this purpose (see Section 13.2.21) and also consider the use of k-Nearest-Neighbor and kernel 
regression methods (see Section 13.2.11) . 

3.2. Modeling Methods. We discuss two main approach types for modeling metabolite sample 
fitness. The first approach type includes k-nearest neighbors (kNN) and kernel regression, two con¬ 
ceptually simple approaches not widely utilized for classification in the metabolomics literature. 
They are an intuitive method for classifying fit- and unfit samples, since fit samples should be more 
similar to other fit samples than unfit samples. However, kNN and kernel regression do not identify 
important relationships between significant features and are highly suceptible to misclassification 
due to contributions from insignificant features. Hence we also present a second approach. 

The second approach type is that of Dirichlet process mixture models (DPMM), a non-parametric 
Bayesian approach [BH 1 . Despite the “non-parametric” descriptor, a key advantage of such models 
is not their absence of parameters but their flexible parametric form: the parametric form of the 
probability distributions is inferred along with the parameter values during model training. Char¬ 
acterization of metabolite sample fitness according to DPMM is highly intuitive because we regard 
sample feature measurements as arising from a combination of biological and chemical processes, in¬ 
cluding freshness (ideal sample collection and measurement) and degradation mechanisms including 
chemical (e.g. thermodynamic) and biological (e.g. bacterial) over varying lengths of time. Sample 
fitness may then be assessed according to the estimated probability of each process having generated 
a given sample. 


3.2.1. k-Nearest Neighbors and Kernel Regression. The k-Nearest Neighbor (kNN) classifier [BUMTO] 
is one of the simplest classifiers conceptually and has only one parameter, k. Given training sam¬ 
ples and a new test sample, it classifies the new sample in the majority category of its k nearest 
training-sample neighbors. The nearness is computed according to a metric. Some metrics utilized 
for nearest neighbor classification include m 

(1) L 2 norm: normal Euclidean distance. The distance between two points is the square root of 
the sum of squared differences between the features. 

(2) L 1 norm: The distance between two points is the sum of the absolute values of differences 
between the features. 

(3) Loo norm: The distance between two points is the absolute value of the greatest difference 
between any two features. 

The optimal k may be determined using cross-validation. An advantage to using kNN for modeling 
fit and unfit samples is that it makes no parametric assumptions about the probability distributions 
of the sample groups or about the way in which those groups separate (in contrast to, for exam¬ 
ple, a hyperplane-separation method). However, since all features are treated equally in computing 
the distance between two points, the method is prone to mis-classification without exceptionally 
careful feature selection. K-nearest-neighbor classifiers have been discussed in the metabolomics 
literature but are rarely utilized for classification, in part due to their relatively large computational 
expense m- 
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An extension of kNN is kernel regression mmm- kNN assumes that when classifying a new 
sample, all training samples should receive equal consideration and the number of neighbors (k) 
should be the same for classifying any new sample. Kernel regression relaxes those assumptions by 
defining a kernel which gives different weight to each training sample in the classification. Generally 
speaking, a kernel is a similarity function which maps pairs of data points to a number; more similar 
(“nearer”) data points should be mapped to larger numbers. One widely-used kernel is a Gaussian 
kernel, which gives stronger weight to nearer training samples and less weight to farther-away training 
samples according to an exponential fall-off. For Gaussian kernel regression, the prediction y for a 
new sample x is 

N 

(2) y(x) = argmaxV/(/(x l ) = y)A'(x i ,x), 

y^ Y t! 

where y is a class identifier, Y is the set of classes, N is the number of training samples, / maps a 
training sample to its class identifier, / is the indicator function (1=1 if the argument is true, 1=0 
otherwise), and I\ is the kernel function, in this case the Gaussian kernel [71] : 

(3) A'(x,,x) = — ^ exp —i(xi — x) T E _1 (xi — x) , 

W V a/( 27r) fc det (E) [ 2 J 

where k is the vector dimension of x (i.e. the number of features) and E is a k x k covariance matrix. 
E is a free parameter which may be determined by cross-validation or computed from the training 
data, e.g. 

(4) E = ^f>-x- i )(x l -x- i ) T , 

i= 1 

where Xi is the average of the training samples. 

Many kernels have their own parameters which must be determined by cross-validation; for exam¬ 
ple, the Gaussian kernel has a “spread” or “standard deviation” parameter a governing how fast 
the exponential falls off. As the number of parameters increases, so does the risk for over-fitting the 
prediction model. 


3.2.2. Dirichlet Process Mixture Model. We can conceptualize the sample fitness assessment prob¬ 
lem as follows: we assume that samples are generated by a combination of distinct processes (i.e. 
probability distributions) and consider those processes to be hidden variables in a mixture model. 
For example, we might consider a human urine sample metabolite levels to result from processes 
such as freshness (ideal fresh urine sample collection), bacterial contamination (due to exposure to 
above-freezing storage temperatures) over varying lengths of time, and chemical interactions within 
the urine (e.g. breakdown of thermodynamically unstable compounds or slow reactions) over vary¬ 
ing lengths of time. Our proposed experimental procedure includes collection and measurement 
of biofluid metabolite levels after variable-time exposure to a range of storage temperatures. We 
propose to model the resulting data using a Dirichlet process mixture model (DPMM). To assess 
the fitness of a new sample, the DPMM will estimate the maximum posterior probability that the 
sample arose from each process. In this way we obtain a quantitative estimate of the degree to which 
a sample is generated from, e.g., freshness vs. various degradation processes. To make a sample 
fitness determination, we consider the following possibilities: 
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(1) Based on the principle of sample fitness (see Section lOT) . the training samples may be used 
to establish a threshold for the probability of “fresh” process generation which constitutes 
a fit sample. Hence the final classifier predicts that the sample is fit or unfit based on the 
probability that the sample arose from the “fresh” process. 

(2) However, more than one process may be associated with sample “freshness”. Therefore it 
may be desirable to use the process probabilities computed by the DPMM as inputs to a 
final classifier for sample fitness. For this procedure, we will explore the use of classifiers 
such as kNN, O-PLS-DA [50] , and hyperplane methods such as support vector machine [59] 
or soft independent modeling of class analogy [75] (SIMCA). Hence the final classifier assess 
sample fitness using the full vector of process probabilities computed by the DPMM. 

The advantage of using a DPMM lies in its parametric flexibility: the priors for each process are 
defined over a space of functions, and the appropriate parametric representation of each process is 
inferred together with the parameter values. In contrast, for more strongly-parametrized approaches 
such as a Gaussian Mixture Model, the parametric representation of each process is assumed and 
only the parameter values are computed in model training. 
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